Search results for "massive datasets"

showing 3 items of 3 documents

Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis

2012

AbstractThe advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of pre…

Settore INF/01 - InformaticaGeneral Computer Sciencebusiness.industryComputer scienceBioinformaticsModel selectionGeneral statisticsMachine learningcomputer.software_genreTheoretical Computer ScienceComputational biologyAnalysis of massive datasetsMachine learningCluster (physics)Algorithms and data structures General statistics Analysis of massive datasets Machine learning Computational biology BioinformaticsAlgorithms and data structuresAlgorithm designArtificial intelligenceCluster analysisbusinessCompleteness (statistics)computerComputer Science(all)Theoretical Computer Science
researchProduct

Lightweight LCP construction for next-generation sequencing datasets

2012

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and B…

Whole genome sequencingGenomics (q-bio.GN)FOS: Computer and information sciencesSequenceBWT; LCP; next-generation sequencing datasetsBWT LCP text indexes next-generation sequencing datasets massive datasetsSettore INF/01 - InformaticaComputer scienceComputationString (computer science)LCP arrayParallel computingData structureDNA sequencingSubstringBWTLCPFOS: Biological sciencesComputer Science - Data Structures and AlgorithmsQuantitative Biology - GenomicsData Structures and Algorithms (cs.DS)next-generation sequencing datasets
researchProduct

Towards A Twitter Observatory: A Multi-Paradigm Framework For Collecting, Storing And Analysing Tweets

2016

International audience; In this article we show how a multi-paradigm framework can fulfil the requirements of tweets analysis and reduce the waiting time for researchers that use computational resources and storage systems to support large-scale data analysis. The originality of our approach is to combine concerns about data harvesting, data storage, data analysis and data visualisation into a framework that supports inductive reasoning in multidisciplinary scientific research. Our main contribution is a polyglot storage system with a generic data model to support logical data independence and a set of tools that can provide a suitable solution for mixing different types of algorithms in or…

[ INFO.INFO-IR ] Computer Science [cs]/Information Retrieval [cs.IR][ INFO ] Computer Science [cs]Computer scienceknowledge discovery02 engineering and technology[INFO] Computer Science [cs][INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]Data modelingmassive datasetsopen source softwareData visualization[ INFO.INFO-IT ] Computer Science [cs]/Information Theory [cs.IT]polyglot storage020204 information systems0202 electrical engineering electronic engineering information engineering[INFO]Computer Science [cs]Twitter analysis . SystemsComputingMilieux_MISCELLANEOUS[INFO.INFO-DB]Computer Science [cs]/Databases [cs.DB]business.industryPolyglotInductive reasoningData science[SPI.TRON] Engineering Sciences [physics]/ElectronicsData independence[ SPI.TRON ] Engineering Sciences [physics]/Electronics[SPI.TRON]Engineering Sciences [physics]/ElectronicsData model[INFO.INFO-IT]Computer Science [cs]/Information Theory [cs.IT][INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]020201 artificial intelligence & image processing[INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR][INFO.INFO-IT] Computer Science [cs]/Information Theory [cs.IT]Data architecturebusinessSoftware architecture
researchProduct